{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plotting non-geographic data\n",
    "\n",
    "Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions.  Here let's start by plotting `trip_distance` versus `fare_amount` for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load NYC Taxi data\n",
    "\n",
    "(takes a dozen seconds or so...)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('data/nyc_taxi.csv',usecols=['trip_distance','fare_amount','tip_amount','passenger_count'])\n",
    "\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Define a simple plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.plotting import figure, output_notebook, show\n",
    "\n",
    "output_notebook()\n",
    "\n",
    "def base_plot():\n",
    "    p = figure(\n",
    "        x_range=(0, 20),\n",
    "        y_range=(0, 40),\n",
    "        tools='pan,wheel_zoom,box_zoom,reset', \n",
    "        plot_width=800, \n",
    "        plot_height=500,\n",
    "    )\n",
    "    p.xgrid.grid_line_color = None\n",
    "    p.ygrid.grid_line_color = None\n",
    "    p.xaxis.axis_label = \"Distance, miles\"\n",
    "    p.yaxis.axis_label = \"Fare, $\"\n",
    "    p.xaxis.axis_label_text_font_size = '12pt'\n",
    "    p.yaxis.axis_label_text_font_size = '12pt'\n",
    "    return p\n",
    "    \n",
    "options = dict(line_color=None, fill_color='blue', size=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1000 points reveals the expected linear relationship"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samples = df.sample(n=1000)\n",
    "p = base_plot()\n",
    "p.circle(x=samples['trip_distance'], y=samples['fare_amount'], **options)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10,000 points show more detailed, systematic patterns in fares and times\n",
    "  \n",
    "Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "options = dict(line_color='blue', fill_color='blue', size=1, alpha=0.05)\n",
    "samples = df.sample(n=10000)\n",
    "p = base_plot()\n",
    "p.circle(x=samples['trip_distance'], y=samples['fare_amount'], **options)\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Datashader reveals additional detail, especially when zooming in\n",
    "\n",
    "You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datashader as ds\n",
    "from datashader.bokeh_ext import InteractiveImage"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = base_plot()\n",
    "pipeline = ds.Pipeline(df, ds.Point(\"trip_distance\", \"fare_amount\"))\n",
    "InteractiveImage(p, pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we're using the default histogram-equalized color mapping function to reveal density differences across this space.  If we used a linear mapping, we can mainly see that there are a lot of values near the origin, but all the rest are colored the same minimum (defaulting to light blue) color:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datashader import transfer_functions as tf\n",
    "import functools as ft\n",
    "color_fn = ft.partial(tf.shade,how='linear')\n",
    "\n",
    "p = base_plot()\n",
    "pipeline = ds.Pipeline(df, ds.Point(\"trip_distance\", \"fare_amount\"), color_fn=color_fn)\n",
    "InteractiveImage(p, pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fares are discretized to the nearest 50 cents, making patterns less visible, but there is both an upward trend in tips as fares increase (as expected), but also a large number of tips higher than the fare itself, which is surprising:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = base_plot()\n",
    "p.xaxis.axis_label = \"Fare, $\"\n",
    "p.yaxis.axis_label = \"Tip, $\"\n",
    "pipeline = ds.Pipeline(df, ds.Point(\"fare_amount\", \"tip_amount\"))\n",
    "InteractiveImage(p, pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interestingly, tips go down when the number of passengers is greater than 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datashader as ds\n",
    "from datashader.bokeh_ext import InteractiveImage\n",
    "from bokeh.models import Range1d\n",
    "\n",
    "p = base_plot()\n",
    "p.xaxis.axis_label = \"Passengers\"\n",
    "p.yaxis.axis_label = \"Tip, $\"\n",
    "p.x_range = Range1d(-0.5, 6.5)\n",
    "p.y_range = Range1d(0, 60)\n",
    "\n",
    "pipeline = ds.Pipeline(df, ds.Point(\"passenger_count\", \"tip_amount\"), width_scale=0.035)\n",
    "InteractiveImage(p, pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we've reduced the resolution along the x axis so that instead of getting isolated points for this inherently discrete data, you can see more-visible horizontal line segments.\n",
    "\n",
    "The above plots all use Bokeh directly, but a much wider range of interactive plots can be built easily using the separate [HoloViews](http://holoviews.org) library, which builds Bokeh and Matplotlib plots from high-level specifications.  For instance, Datashader currently only provides 2D aggregates, but you can easily make a zoomable one-dimensional histogram using HoloViews to dynamically collapse across a second dimension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result=None\n",
    "try:\n",
    "    import numpy as np\n",
    "    import holoviews as hv\n",
    "    from holoviews.operation.datashader import aggregate\n",
    "    hv.notebook_extension('bokeh')\n",
    "\n",
    "    %opts Curve [width=800]\n",
    "    \n",
    "    dataset = hv.Dataset(df, kdims=['fare_amount', 'trip_distance'], vdims=[]).select(fare_amount=(0,60))\n",
    "    agg = aggregate(dataset, aggregator=ds.count(), streams=[hv.streams.RangeX()], x_sampling=0.5, width=500, height=2)\n",
    "    result = agg.map(lambda x: x.reduce(trip_distance=np.sum), hv.Image)\n",
    "    \n",
    "except ImportError: pass\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here datashader is aggregating over both fare_amount and trip_distance, but trip_distance was specified to have only a height of 2, because it will be further collapsed to create the histogram being displayed.  You can now use the wheel zoom tool when hovering over the x axis, and the plot will zoom in or out, dynamically resampling at the given location to make a new histogram (as long as there is a live Python server running). \n",
    "\n",
    "In this particular plot, there is a very wide range of fare amounts, with an implausibly high maximum fare of over 4000 dollars, but you can easily zoom in to the bulk of the data to show that nearly all fares are between 4 and 20 dollars, following something like a gamma distribution, and they are discretized to the nearest 50 cents in this dataset."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
